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Abstract 

Numerical quantum transport calculations are commonly based on a tight-binding formulation. 
A wide class of quantum transport algorithms requires the tight-binding Hamiltonian to be in 
the form of a block-tridiagonal matrix. Here, we develop a matrix reordering algorithm based 
on graph partitioning techniques that yields the optimal block-tridiagonal form for quantum 
transport. The reordered Hamiltonian can lead to significant performance gains in transport 
calculations, and allows to apply conventional two-terminal algorithms to arbitrary complex ge- 
ometries, including multi-terminal structures. The block-tridiagonalization algorithm can thus 
be the foundation for a generic quantum transport code, applicable to arbitrary tight-binding 
systems. We demonstrate the power of this approach by applying the block-tridiagonalization 
algorithm together with the recursive Green's function algorithm to various examples of meso- 
scopic transport in two-dimensional electron gases in semiconductors and graphene. 
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1. Introduction 

If the dimensions of a device become smaller than the phase coherence length 1$ of 
charge carriers, classical transport theories are not valid any more. Instead, carrier dy- 
namics is now governed by quantum mechanics, and the wave-like nature of particles 
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becomes important. In general, the conductance/resistance of such a device does not 
follow Ohm's law. 

In the regime of coherent quantum transport, the Landauer-Buttiker formalism [HULI!! 
relates the conductance G of a device to the total transmission probability T of charge 
carriers through the device, 
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mn 

where t mn is the transmission amplitude between different states with transverse quan- 
tum numbers n and m in the left and right lead, respectively. A state with a given 
transverse quantum number n is also called channel n. 

The problem of calculating the conductance is thus reduced to calculating scattering 
eigenfunctions ip for a given energy E: 

(E-H)il> = 0, (2) 

where H is the Hamiltonian of the system. Alternatively, the transmission probability can 
be extracted from the retarded Green's function G r that obeys the equation of motion 

(E - H)G r = 1 . (3) 

The Fisher-Lee relation 0, then allows to calculate the transmission (t mn ) and reflec- 
tion (r nm ) amplitudes from G r . In its simplest form, the Fisher-Lee relation reads 

t mn = -ifh/v m v n / dy dy'<j> m (y) G R (x, x') <j> n (y') , (4) 

« Cr "'Cl 

and 

r mn = S mn - ih^VmVn \ dy dy' 4>m(y) G R (x, x') <fi n {y') , (5) 
JCl JCl 

where v n is the velocity of channel n and the integration runs over the cross-section Cl 
(C r ) of the left (right) lead. 

The Landauer-Buttiker formalism can also deal with multi-terminal systems, but is 
restricted to linear response, i.e. small bias voltages. In the general case including exter- 
nal bias, the conductance can be calculated using the non-equilibrium Green's function 
formalism (see, e.g. @). 

Except for particularly simple examples, solving Eqs. ((2]) and ([3]) exactly is not pos- 
sible, and therefore a numerical computation is often the method of choice. Instead of 
solving directly a differential equation with its continuous degrees of freedom, such as the 
Schrodinger equation, numerical computations are usually only attempted within a dis- 
crete basis set. The differential equation is then replaced by a set of linear equations, and 
the Hamiltonian H can be written as a matrix. Very often, only few of the matrix elements 
Hij are nonzero. Such tight-binding representations of the Hamiltonian are ubiquitous in 
quantum transport calculations and can arise from finite differences 0, HS, from the 
finite element method from atomic orbitals in empirical tight-binding [111 12 . 13[ or 
Kohn-Sham orbitals within density functional theory [14, T^, lq |. 

When describing transport, the systems under consideration are open and thus extend 
to infinity. As a consequence, the tight-binding matrix H is infinite-dimensional. However, 
the conductance calculation can be reduced to a finite problem by partitioning the system 
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Fig. 1. (a) Schematic view of a finite difference grid in a two-terminal transport setup, (b) Natural 
ordering of grid points yielding a block-tridiagonal matrix structure. The different matrix blocks are 
marked in alternating shades of grey. 



into a finite scattering region attached to leads that extend to infinity, as schematically 
depicted in Fig. [TJa). For the case of two-terminals, the matrix H can be written as 



H 



1 H L V LS N 
v V RS H K j 



(0) 



where -ffL(R) is the (infinite) Hamiltonian of the left (right) lead, Hg is the Hamiltonian of 
the scattering region and of finite size. The matrices Vsl = V^jg an< ^ ^SR = V^.g represent 
the coupling between the scattering region and the left and right lead, respectively. 

In order to reduce the problem size, it is useful to introduce the retarded self-energy 
S r = J2i=L r Vsi 9i Vis, where is the surface Green's function of the left (right) 

lead, i.e. the value of the Green's function at the interface of the lead disconnected from 
the scatterin g region . Then, the Green's function Gs of the scattering region can be 



calculated as [17j, Il8l ] 



G s = (£-ff s -£ r r 



(7) 



reminiscent of Eq. ([3]) but with an effective Hamiltonian H$ + E r of finite size. This 
treatment is easily extended to multi-terminal systems. 

Note that it suffices to know the surface Green's function of the (semi-)infinite leads, as 
in a tight-binding Hamiltonian the matrices Vsl and Vsr have only few nonzero entries. 
For simple systems, the surface Green's function can be calculated analytically \vH [l8j]. 
whereas in more complex situations it must be computed numerically, either by iteration 
1^ . 2fj| , or by semi-analytical formulas [12, 3, 21 \ ■ 

The original infinite-dimensional problem has thus been reduced to a finite size matrix 
problem that can, in principle, be solved straight-forwardly on a computer. However, for 
any but rather small problems, the computational task of the direct inversion in Eq. ([7]) 
is prohibitive. Therefore, for two-terminal transport, many algorithms make use of the 
sparsity of the Hamiltonian matrix in tight-binding representation - in particular that 
this matrix can be written in block-tridiagonal form: 
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(8) 

where the index L (R) denotes the blocks in the left (right) lead, 1 ... N the blocks within 
the scattering region, and (N + 1) the first block in the left (right) lead. Such a form 
arises, for example, naturally in the method of finite differences, when grid points are 
grouped into vertical slices according to their ^-coordinates, as shown in Fig. [ljb), but 
also applies to any other sparse tight-binding Hamiltonian. 

The block-tridiagonal form of the Hamiltonian is the foundation of several quantum 
transport algorithms for two-terminal systems. The transfer matrix approach applies 
naturally to block-tridiagonal Hamiltonians, but becomes unstable for larger systems. 
However, a stabilized version has been developed by Usuki et al. 22|, |23[. In the decima- 
tion technique (HiUll, the Hamiltonian of the scattering region is replaced by an effective 
Hamiltonian between the two leads by eliminating internal degrees of freedom. The con- 
tact block reduction method [26[ calculates the full Green's function of the system using 
a limited set of eigenstates. The recursive Green's function (RGF) technique 27, 28|, 29| 
uses Dyson's equation to build up the system's Green's function block by block. It has 
also been adapted to H all g eometries with four terminals [3(| and to calculate non- 
equilibrium densities Furthermore, the RGF algorithm has been formulated to 
be suitable for parallel computing 33]. 

Of course, there are also other transport techniques not directly based on the block- 
tridiagonal form of the Hamiltonian matrix, such as extracting the Green's function from 
wave packet dynamics [34| ■ Still, such algorithms are not as widely used as the large class 
of algorithms, that are directly based on the block-tridiagonal form of the Hamiltonian. 
In order to illustrate the typical computational tasks of this class of algorithms, we briefly 
explain, as a representative example, the RGF algorithm. 

The RGF technique is based on Dyson's equation G r = G r Q + G T VG r (see, e.g @), 
where G r denotes the Green's function of the perturbed system, G T that of the unper- 
turbed system and V the perturbation. Using this equation, the system is built up block 
by block, as depicted in Fig. [51 Let G T '( 1 ' denote the Green's function for the system 
containing all blocks > i. Then, at energy E, the Green's function G r >( i_1 ) is related to 
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Fig. 2. Schematic depiction of the recursive Green's function algorithm: (a) The Green's function G r 'M 
contains all blocks > i. (b) The Green's function G r, ' l_1 -' is obtained by adding another matrix block. 
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Starting from G^^ 4 ^^ = g^, the surface Green's function of the right lead, ./V slices 

are added recursively, until G T '^ 1 ' has been calculated. The blocks of the Green's function 
of the full system necessary for transport are then given by 



^o,o — 



(oar 
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r iV+l,0 

where of is the surface Green's function of the left lead 



(11) 
(12) 

3l is tne surtace wreen s function ot tne lett lead. Uq and G 1 N+1 arc sufficient 
to calculate transmission and reflection probabilities via the Fisher-Lee relation, Eqs. (|4]) 
and (©. 

Each step of the algorithm performs inversions and matrix multiplications with matri- 
ces of size Mi. Since the computational complexity of matrix inversion and multiplications 
scales as Mf, the complexity of the RGF algorithm is oc 

ES 1 M i- Thus > it scales lin " 
early with the "length" N, and cubically with the "width" Mj of the system. This scaling 

also applies to most of the other transport algorithms mentioned above. 

While for particular cases general transport algorithms, such as the RGF algorithm, 

cannot compete with more specialized algorithms, such as the modular recursive Green's 



function technique 35, 36J that is optimized for special geometries, they are very versatile 
and easily adapted to two-terminal geometries — provided that the leads are arranged 
collincarly. Amongst other things, this restriction will be lifted by the approach presented 
in this work. 

Although the block-tridiagonal structure of H, Eq. |8|), that arises naturally in many 
problems appears to have a small "width" and thus seems to be quite suitable for trans- 
port algorithms, optimizing the block-tridiagonal structure by reordering the matrix H 
may lead to a significant speed-up in the conventional two-terminal algorithms, as we 
show below. Furthermore, such a reordering allows for the application of the estab- 
lished two-terminal algorithms to more complex geometries, such as non-collinear leads 
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or multi-terminal structures, that would otherwise need the development of specialized 
algorithms. 

Below, we develop a matrix reordering algorithm based on graph partitioning tech- 
niques that brings an arbitrary matrix H into a block-tridiagonal form optimized for 
transport calculations. To this end, the paper is organized as follows. In section [2] we 
formulate the matrix reordering problem in the language of graph theory and develop 
the reordering algorithm. In section [3] we apply this algorithm to various examples and 
investigate its performance and the performance of the RGF algorithm for the reordered 
Hamiltonian H. We conclude in section |U 



2. Optimal Block-tridiagonalization of matrices 

2.1. Definition of the problem 

2.1.1. Definition of the matrix reordering problem 

As shown in the introduction, the typical runtime of transport algorithms, is propor- 
tional to ^^Jq 1 Mf, does depend on the particular block-tridiagonal structure of H. 
Therefore, the runtime of these algorithms can be improved in principle by conveniently 
reordering H with a permutation P, 

H^PHP- 1 . (13) 

In order to quantify how a typical transport algorithm performs for a given matrix 
structure, we define a weight w{H) associated with a matrix H as 

JV+l 

w(H) = ^ , ( 14 ) 

where Mi is the size of block Hi^. Optimizing the matrix for transport algorithms is then 
equivalent to minimizing the weight w(H). Since X^o* ^ = -^gridj where -/V gr id is the 
total number of grid points, w(H) is minimal, if all Mi are equal, Mi = N gI [d/(N + 2). 
Therefore, a matrix tends to have small weight, if the number N of blocks is large, and 
all blocks are equally sized. The reordering problem of the matrix H is thus summarized 
as follows: 

Problem 2.1. Matrix reordering problem: Find a reordered matrix H' such, that 

(i) Hq and H' N+1N+1 are blocks given by the left and right leads (as required by 
transport algorithms) 

(ii) H' is block-tridiagonal (ify ^ 0, iff j = i + 1, i, i — 1), 

(iii) the number N of blocks is as large as possible, and all blocks are equally sized. 

In principle, this constrained optimization problem could be solved by generic opti- 
mization algorithms, such as Simulated Annealing. However, for larger problems the opti- 
mization could take much more time than the actual transport calculation, rendering the 
optimization process useless. It is therefore necessary to use heuristics especially designed 
for the problem at hand. To this end, we formulate the matrix reordering problem in the 
language of graph theory. 
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Fig. 3. (a) Simple example showing the connection between a graph, its graphical representations with 
dots and lines, and the zero-nonzero structure of a matrix, (b) Example of a finite difference grid, that 
can be interpreted as a graph, and the structure of the corresponding matrix. Nonzero entries are shown 
as black dots. 

2.1.2. Mapping onto a graph partitioning problem 

A graph Q is an ordered pair Q = (V,£ ), where V is a set of vertices v and £ a set of 
ordered pairs of vertices (vi, v%) G V x V. Such a pair is called an edge. A graph is called 
undirected, if for every edge {v\,V2) G £ also {v2,v\) £ £ ■ Two vertices v\ and t>2 are 
called adjacent, if {v\,V2) G £. In order to simplify the notation, we will also consider a 
vertex v to be adjacent to itself. 

There is a natural one-to-one correspondence between graphs and the structure of 
sparse matrices. For a given n x n matrix H, we define a graph Q = (V,£) with V = 
{1, . . . , n} and (i,j) G £ iff the entry htj ^ 0. A graph thus stores information about the 
structure of a matrix, i.e. which entries are nonzero. It does not contain any information 
about the values of the respective entries, although these may be stored easily along with 
the graph. However, for the formulation of the quantum transport algorithms, only the 
block-tridiagonal form, i.e. the structure of the matrix, is relevant. Hermitian matrices, 
that are considered in quantum transport, have a symmetric structure of zero and nonzero 
entries, and therefore the corresponding graphs are undirected. 

A graph can be depicted by drawing dots for each vertex v, and lines connecting 
these dots for every edge {v\,V2), as shown in Fig. [3ja) . It should be noted that a 
graphical representation of a tight-binding grid, such as shown in Fig.[3fb), can be directly 
interpreted as a representation of a graph and the corresponding matrix structure. 

In terms of graph theory, matrix reordering corresponds to renumbering the vertices of 
a graph. Since we are only interested in reordering the matrix in terms of matrix blocks 
(the order within a block should not matter too much), we define a partitioning of Q as 
a set {Vi} of disjoint subsets Vj C V such that \J i Vj = V and Vi fl Vj = for % ^ j. Using 
these concepts, we can now reformulate the original matrix reordering problem into a 
graph partitioning problem: 

Problem 2.2. Graph partitioning problem: Find a partitioning {Vo, • • ■ , Vn+i} of 
Q such that: 
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(i) Vo and Vjv+i contain the vertices belonging to left and right leads, 

(ii) (a) vertices in Vo and Vtv+i are only connected to vertices in Vi and Vat, respec- 

tively, 

(b) for < i < N + 1, there are edges between Vi and Vj iff j = i + 1, i,i — 1, 

(iii) the number N + 2 of sets Vi is as large as possible, and all sets Vi have the same 
cardinality |Vj|. A partitioning with all |Vi| equally sized is called balanced. 

A partitioning obeying requirement 12 . 2liH is called a level set with levels Vi [37[- Level 
sets appear commonly as an intermediate step in algorithms for bandwidth reduction of 
matrices [13, [HI HH, Eoj . These algorithms seek to find a level set of minimal width, i. c. 
maxi = o...jv+i |V«| as small as possible which is equivalent to requirement 12 . 2liiil The main 
difference between our graph partitioning problem and the bandwidth reduction problem 
is requirement 12 . 2ffl In the graph partitioning problem, Vo and Vat are determined by the 
problem at hand, while in the bandwidth reduction problem these can be chosen freely. 
Due to this difference, bandwidth reduction algorithms can be applied successfully to our 
graph partitioning problem only for special cases, as we show below. 

The term graph partitioning usually refers to the general problem of finding a balanced 
partitioning {V^} of a graph and has many applications in various fields such as very-large- 
scale integration (VLSI) design (JJ, Sj| , sparse matrix reorderings for LU or Cholesky 
decom pos itions [3] , or block ordering of sparse matrices for parallel computation [45|, 14a . 
I47I I4R )49l . [50( 1 . In particular, the latter examples also include block-tridiagonal orderings 
[4a , l47j |. However, as these reorderings are geared towards parallel computation, they 
obtain a fixed number N of sets Vi given by the number of processors of a parallel 
computer, whereas in our block-tridiagonal reordering the number N should be as large 
as possible. In addition to that, the constraints on the blocks Vo and V/v+i (requirement 
I2.2li|) are again not present there. 

As we cannot directly employ existing techniques to solve the graph partitioning prob- 
lem, we will develop an algorithm combining ideas from both bandwidth reduction and 
graph partitioning techniques in the subsequent sections: Concepts from bandwidth re- 
duction are employed to construct a level set which is then balanced using concepts from 
graph partitioning. 



2.2. Matrix reordering by graph partitioning 

2.2.1. A local approach — breadth first search 

A breadth-first-search (BFS) [HI] on a graph immediately yields a level set 37, 3^, 3^, 
lifll ]. In our particular example, the level set is constructed as follows: 
Algorithm 1. Level set construction by breadth-first-search. 

A Start from i = 0. Then, Vi — Vo, as the first level is given by the constraints of 
requirement ((j}. 

B If there is a vertex in Vi that is adjacent to a vertex in Vn+i, assign all the remaining 

unassigned vertices into Vi and end the algorithm. 
C All vertices adjacent to Vi that are not contained in the previous levels Vi, Vi-i, . . . Vo 

are assigned to V»+i. 
D Continue at step B with i = i + 1. 

Note that the sets {Vi} form a level set by construction — a set Vi may only have vertices 
adjacent to V^_i and Vj+i. The construction by BFS not only obtains the number of levels 
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Fig. 4. Level set created by a BFS starting from Vo- Different levels are shown in alternating shades of 
grey. 

N + 2 for a particular realization, but yields a more general information: 
Lemma 2.3. The number of levels N + 2 in the level set constructed by algorithm]]] is 
the maximum number of levels compatible with the constraints on the initial and final 
level Vo and Vn+i for a graph Q . 

This can be seen from the fact that a BFS finds the shortest path in the graph between 
the initial sets Vo and Vjv+ij (vq, Vi, ■ ■ ■ > Vi, . . . , vn+i) where vq € Vo and vn+i <E Vjv+i- 
Any vertex on this shortest path can be uniquely assigned to a single level V; and it 
would not be compatible with a larger number of levels than N + 2. 

Algorithm [T] not only yields the maximum number of levels: All vertices contained in 
the first n levels of the BFS must be contained in the first n levels of any other level set. 
Lemma 2.4. Let {Vo, Vi, . . . , Vn+i} be a level set constructed by algorithm]]^ and 
{Vq, V(, . . . , V' N , +1 } another level set consistent with the requirements of problem \2.S\ with 
N' < N. Then V U Vi U • • • U V„ C V U V( U • • • U V' n for Q < n < N' + 1. 

The statement is proved by induction. It is true trivially for n = (because of re- 
quirement [j] in problem I2.2[) and for n = N' + 1 (then the levels cover the whole graph) . 
Suppose now that the statement holds for n < N'. Note that for the proof it suffices 
to show that V n +i C Vg U V{ U . . . ,UV^+i- Consider now the set of all vertices adja- 
cent to V„, adjacent(V„) = {v € V | v is adjacent to some v' E V„}. By construction, 
V„ + i C adjacent(V„). Since V„ C V U V[ U • ■ ■ U V' n and V- is a level set, all ver- 
tices adjacent to V n must be contained in the set of vertices including the next level, 
i.e. adjacent(V„) C V UViU. . . , UV^ U V' n+1 . But then also V n+ i C V UViU. . . ,UV^ +1 , 
which concludes the proof. 

Thus, the vertices contained in the first n levels of the BFS form a minimal set of 
vertices needed to construct n levels. However, this also implies that the last level which 
then covers the remaining vertices of the graph, may contain many more vertices than 
the average, leading to an unbalanced level set. This is not surprising, since the algorithm 
does not explicitly consider balancing and only local information is used, i.e. whether a 
vertex is adjacent to a level or not. An example for this imbalance is shown in Fig. [4j 
the BFS construction yields a very large last level. 

Note that throughout the manuscript we visualize the graph theoretical concepts using 
examples of graphs obtained from discretizing a two-dimensional structure. However, 
the ideas and algorithms presented here apply to any graph and are not limited to 
graphs with coordinate information. Two-dimensional graphs have the advantage of being 



9 




visualized easily. In particular, the BFS search has an intuitive physical analog: Wave 
front propagation of elementary waves emanating from the vertices of the initial level Vo- 
The problem that a BFS does not yield a balanced partitioning was also noted in the 
theory of bandwidth reduction. The Gibbs-Poole-Stockmeyer (GPS) algorithm tries to 
overcome this deficiency by constructing a level set through the combination of two BFS 
searches starting from the initial and the final levels. However there the initial and final 
levels are sought to be furthest apart, contrary to our problem. In general, the GPS 
construction only yields a balanced level set if the initial and final level are close to 
furthest apart, as we will show in Sec.[3J 



2.2.2. A global approach — recursive bisection 

In order to obtain a balanced partitioning, graph partitioning algorithms commonly 
perform a recursive bisection, i.e. successively bisect the graph and the resulting parts 
until the desired number of parts is obtained [4l|, 0, 47 , [5^, Hj|. This approach 



has the advantage of reducing the partitioning problem to a simpler one, i.e. bisection. 
Furthermore, if the resulting parts of every bisection are equally sized, the overall par- 
titioning will be balanced. In addition, bisection is inherently a global approach, as the 
whole graph must be considered for splitting the system into two equally sized parts. 
Thus, it can be expected to yield better results than a local approach, such as BFS. 

We intent to construct a level set with N + 2 levels, where N + 2 is the maximum 
number of levels as determined by algorithm [1] To this end we start from an initial parti- 
tioning {Vo, Vi, Vn+i}, where Vo and Vn+i contain the vertices of the leads (requirement 
I2.2lij) . and Vi all other vertices. The level set is then obtained by applying the bisection 
algorithm recursively to Vi and the resulting subsets, until N levels are obtained, as 
shown schematically in Fig. [51 Here bisection means splitting a set Vi into two sets, V^ 
and Vi 2 , such that Vi 1 U Vi 2 = Vi and V^ n Vi 2 = 0. In oder to be applicable to the graph 
partitioning problem 12.21 the bisection must comply with certain requirements: 
Problem 2.5. The bisection algorithm must be 

(i) compatible with a level set with N + 2 levels. 

(ii) balanced. 

(iii) performed such that subsequent bisections may lead to a balanced level set. 
Requirement I2.5liiil is formulated rather vaguely: Usually there are many different 

choices how to perform a bisection. A particular choice will influence the subsequent 
bisections (for a similar problem in graph partitioning see [531] ) , and thus the bisection 
algorithm must in principle take all following bisection steps into account. Since an exact 
solution to that problem seems computationally intractable, we will resort to heuristics 
there. 

We start explaining how to comply with requirements 12.5111 and I2.5liil In the following 
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BFS from right 



Fig. 6. (a) Example showing for a disk-type geometry the BFS from the left and right neighboring 
sets that construct the minimal set of vertices V^bfs (black) and Vi 2i BFS (dark grey) that must be 
contained in Vi ± and Vi 2 , respectively. The remaining vertices (light grey) can be assigned to either set. 
(b) and (c): Examples illustrating the difference between cut edges and cut nets. The number of cut 
edges is 5 in both (b) and (c), while the number of cut nets (boundary vertices) is 10 in (b) and 9 in (c). 

we assume that N > 0, as iV = —1,0 are trivial cases. Then the initial partitioning 
{Vo, Vi, Vjv-fi} forms a level set, and so will the final result of the recursive bisection, 
if the result of every intermediate bisection yields a level set. For this, consider a set 
Vi with vertices adjacent to the sets Vj left and Vi r , ht) where "left" ( "right" ) is defined as 
being closer to Vo (Viv+i). Then the sets resulting from the bisection, Vi 1 and Vi 2 may 
only have vertices adjacent to Vj Jeft ,Vi 2 and V^ ,Vi right , respectively. 

Apart from the condition of forming a level set, requirement 12 . 5(11 also dictates the total 
number of levels. Due to the nature of the recursive bisection, the number of final levels 
contained in an intermediate step is always well-defined. If a set Vi contains Ni levels, then 
Vsj and Vi 2 must contain = Int(iVj/2) and Ni 2 = Ni — Int(iV,/2) levels, respectively. 
Here, Int(. . . ) denotes rounding off to the next smallest integer. The bisection is thus 
balanced, if 

|ViJ*^|V<| and |V i2 |«^|Vi|. (15) 

Note that Ni can take any value, and usually is not a power of two. 

From Lemma 12.41 we know that the minimum set of vertices necessary to form n levels 
is given by a BFS up to level n. Let V^bfs (Vj 2i BFs) denote the set of vertices found by a 
BFS starting from Vi lcft (Vj r , M ) up to level Ni 1 (AT, 2 ). Then, for any bisection complying 
with requirement I2.5lil V^bfs C V^ and Vi 2i bfs C Vj 2 . These vertices are uniquely 
assigned to Vi t and Vi 2 and are consequently marked as locked, i.e. later operations may 
not change this assignment. An example for the vertices found in a BFS is shown in 
Fig. EJa). Note that in the initial bisection, Vi = Vi, Ni = N, Vi loft = Vo, and Vj J<jft = 
Vjv+i. 

The remaining unassigned vertices can be assigned to either set, and the bisection 
will still be compatible with a level set containing N + 2 vertices. Thus for complying 
with requirement I2.5liil any prescription obeying the balancing criterion may be used. 
We choose to distribute the remaining vertices by continuing the BFS from Vj lcft and 
Vj ri ht and assigning vertices to V^ and Vi 2 depending on their distance to the left or 
right neighboring set, while additionally obeying the balancing criterion. This approach — 
assigning vertices to levels according to their distance from the initial and final set — is 
rather intuitive and probably the procedure that would be used if the level set were to 
be constructed "by hand" . This procedure may lead to reasonable level sets, however in 
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general, additional optimization on the sets and Vi 2 is needed, as discussed below. 
If this optimization is used, it can also be useful to distribute the unassigned vertices 
randomly, as this may help avoiding local minima. 

As mentioned above, there is a lot of arbitrariness in distributing the unassigned ver- 
tices into V-jj and Vi 2 . However, the particular choice of the bisection will influence 
whether a later bisection is balanced or not: If V^^^bfs contains more vertices than 
the balance criterion (fT5|) . the bisection cannot be balanced. Obviously, the BFS that 
constructs Vj^j^BFS depends on the details of the set Vi and thus on the details of the 
previous bisection step. 

In order to formulate a criterion that may resolve the above mentioned arbitrariness 
and help to find a balanced level set, it is useful to consider the matrix representation of 
the graph Q. Bisecting a graph means ordering the corresponding matrix into two blocks 
that are connected by an off-diagonal matrix H^ ^: 
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(16) 



This off-diagonal matrix will be unchanged by further bisections and thus determines the 
minimum level width that can be achieved. Therefore, the size of the off-diagonal matrix 
Hi lt i 2 should be minimized. 

In a bisection, an edge (vi, v%) € £ is said to be cut, if v\ and belong to different sets, 
i.e. v\ G Vi t and «2 € Vi 2 or vice versa. The entries of Hi lt i 2 correspond to edges cut by 
the bisection, and minimizing the number of entries in j 2 corresponds to minimizing 
the number of edges cut by the bisection (min-cut criterion). This criterion is often 
used in reordering matrices for parallel processing, where the off-diagonal matrix size 
determines the amount of communication between processors. 

However, the number of entries in Hi Xl i 2 is not directly related to the size of the matrix, 
as has been noted in the graph partitioning problem for parallel computing (48j . Instead, 
the size of the off-diagonal matrix is given by the number of surface vertices, i.e. the 
number of vertices that have cut edges. For this, we define a net of a vertex v in a graph 
Q = (V,£) as (HUli 

net(w) = {u e V\u is adjacent to v} . (17) 

Note that v S net(v), as v is adjacent to itself. A net is said to be cut by a bisection, 
if any two vertices v±,V2 € net(i>) are contained in different sets and Vi 2 . Then, the 
number of surface vertices and thus the size of the off-diagonal matrix Hi lt i 2 is given by 
the number of cut nets. Thus, minimizing the number of cut nets {min-net-cut criterion) 
corresponds to minimizing the the number of surface vertices, and thus to minimizing 
the size of the off-diagonal matrix Hi lt i 2 . Furthermore, since the vertices in Vi 1/2i BFS are 
determined by a BFS emanating from the surface vertices, minimizing the number of 
cut nets will usually also lead to a smaller number of vertices in Vi 1/2! BFSj leaving more 
freedom towards achieving a balanced bisection. Figs.^b) and (c) show a comparison of 
the min-cut and min-net-cut criterion for simple examples. In practice, when minimizing 
the number of cut nets, we also use the min-cut criterion to break ties between different 
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bisections with the same number of cut nets (min-net-cut-min-cut criterion) in order to 
avoid wide local minima, that occur frequently in the min-net-cut problem. 

Both the min-cut and min-net-cut bisection problem have been shown to be MV- 
hard [54(. Therefore, only heuristics are available to solve them. These heuristics start 
from an initial (balanced) bisection, such as constructed by the steps outlined above, 
and improve upon this initial bisection. Bere, we choose to use the Fiduccia-Mattheyses 
(FM) algorithm [42), as it is readily available for min-cut and min-net-cut bisection. In 
fact, min-net-cut bisection is a hypergraph partitioning problem, and the FM algorithm 
was originally designed for hypergraph partitioning. Furthermore, the FM algorithm can 
naturally deal with locked vertices that may not be moved between sets, is reasonable 
fast and its underlying concepts are easy to understand. The FM heuristic is a pass-based 
technique, i.e. it is applied repeatedly to the problem (several passes are performed), 
iteratively improving the bisection. More detailed information about the fundamentals 
of the Fiduccia-Mattheyses algorithm are given in appendix [A"l 

We now summarize the steps outlined above and formulate an algorithm for bisection: 
Algorithm 2. Bisection of set V% containing N$ levels, with left (right) neighboring set 

A Stop, if JVj = 1. 

B Do a BFS starting from Vj Jeft up to level = Int(iVj/2) and a BFS starting from 
Vj right up to level N i2 = Af_Int(iVj/2). The vertices found by the BFS are assigned 
to Vjj and Vi 2 , respectively, and are marked as locked. 

C Distribute the remaining unassigned vertices taking into account the balance cri- 
terion (fTS)) . The vertices may be assigned according to either one of the following 
prescriptions: 

a) Continue the BFSs from step B and assign vertices to V< 15 if they are first 
reached by the BFS from Vi lctt , and to Vi 2 , if they are first reached by the BFS 
from Vi right . If a set has reached the size given by the balance criterion, assign 
all remaining vertices to the other set. 

b) Distribute the unassigned vertices randomly to Vi 1 and Vi 2 . If a set has reached 
the size given by the balance criterion, assign all remaining vertices to the other 
set. 

D Optimize the sets Vjj and Vi 2 by changing the assignment of unlocked vertices 
according to some minimization criterion. In particular, the following optimizations 
may be performed: 

a) No optimization. 

b) Min-cut optimization using the FM algorithm. 

c) Min-net-cut optimization using the FM algorithm. 

d) Min-net-cut-min-cut optimization using the FM algorithm. 

Recursive application of the bisection algorithm [2] then leads to an algorithm for con- 
structing a level set complying with the requirements of the graph partitioning problem 
12.21 and thus an algorithm for block-tridiagonalizing a matrix. 
Algorithm 3. Block-tridiagonalization of matrix H 

A Construct the graph Q = (V, £) corresponding to the matrix H, and the sets Vo 

and V/v+i corresponding to the leads. 
B Use algorithm [1] to determine the maximum number of levels N + 2. If N < 1, stop. 
C Construct Vi = V\ (Vo U VV+i), containing N levels. 

D Apply the bisection algorithm [2] to V\ and then recursively on the resulting subsets. 
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Fig. 7. Examples of level sets arising from (a) the natural ordering of grid points (as in Fig. [TJ, and 
application of the block-tridiagonalization algorithm [3] with distribution of vertices by BFS (algorithm 
[2] step C.(a)) (b) without further optimization, (c) with min-cut optimization, (d) with min-net-cut 
optimization. 

Do not further apply if a set only contains one level. 
It should be emphasized, that the block-tridiagonalization does not require any other 
input than the graph structure. In principle, the number of FM passes may affect the 
result. However, from ex per ience, this number can be chosen as a fixed value, e.g. 10 FM 
passes, for all situations [42|. Thus, the block-tridiagonalization algorithm can serve as a 
black box. 

In Fig. [7] we show for examples of level sets arising from the natural ordering of grid 
points (Fig. [T^a), natural level set) and from the block-tridiagonalization algorithm de- 
veloped in this work (Fig. EJb)-(d)) for the case of a disk-type geometry. The level set in 
Fig.[7Jb) arises from recursive bisection, where the vertices were distributed according to 
a BFS without any optimization. The resulting level set strongly resembles the natural 
level set. This is due to the highly symmetric structure and the fact that vertices are 
assigned to levels according to their distance from the leads — only small deviations are 
present due to the balance criterion. When the bisection is optimized according to the 
min-cut criterion, Fig. [Tfc), the resulting level set changes little, as the min-cut crite- 
rion favors horizontal and vertical cuts for a square lattice, as presented in the example. 
In contrast, min-net-cut optimization (Fig. [T^d)) yields a new, non-trivial level set that 
has less symmetry than the underlying structure. Note that the minimization of surface 
vertices leads to levels in the form of "droplets" , analogous to surface tension in liquids. 

In fact, we will show in Sec. [3] that min-net-cut optimization usually leads to level 
sets and thus block-tridiagonal orderings that are superior to those arising from other 
methods. In particular, they are better than the natural level sets, leading to a significant 
speed-up of transport algorithms, as demonstrated in Sec. 13.11 In addition to that, the 
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reordering algorithms allow one to use conventional two-terminal transport algorithms 
also for more complicated, even multi-terminal structures (see Sees. [37T1 and |3~2)) . 

2.2.3. Computational complexity 

We conclude the theoretical considerations with an analysis of the computational com- 
plexity of algorithms [2] and [3l 

The bisection algorithm involves a BFS search on Vi, which scales linearly with the 
number of edges within Vj, and thus has complexity 0(|£;|), where £j is the set of edges 
within Vi. In addition to that, a single optimization pass of the FM algorithm scales also 
as (4^]. Usually, a constant number of passes independent of the size of the graph 

is enough to obtain converged results, and therefore the optimization process of several 
FM passes is also considered to scale as (9(|£j|). Thus, the full bisection algorithm also 
has complexity 0(|£j|). 

Usually, the number of edges per vertex is approximately homogeneous throughout the 
graph. Since the recursive bisection is a divide- and- conquer approach, the computational 
complexity of the full block-tridiagonalization algorithm is then 0(|£|log|£|) [5l| . In 
typical graphs arising from physics problems, the number of edges per vertex is a constant, 
the computational complexity can also be written as 0(N SI n logiVg r id), where JV gr id is 
the number of vertices in V, or the size of the matrix H . 

In contrast, many quantum transport algorithms, such as the recursive Green's function 
technique, scale as 0(N(N grid /N) 3 ) = 0(N 3 rid /N 2 ) in the optimal case of N equally 
sized matrix blocks (levels) of size Ng^/N. Often, the number of blocks (levels) N oc 
-^grid- Typically, to name a few examples, a = 1 in onc-dimensional chains, a = 1/2 
in two dimensions, and the transport calculation scales as 0(N^ A a ). Thus, except for 
the case of a linear chain, where N = -/V gr id and matrix reordering is pointless anyways, 
the block-tridiagonalization algorithm always scales more favorably than the quantum 
transport algorithms. This scaling implies that the overhead of the matrix reordering in 
the transport calculation will become more negligible, the larger the system size. 

3. Examples: Charge transport in two-dimensional systems 

3.1. Ballistic transport in two-terminal devices 

We now evaluate the performance of the block-tridiagonalization algorithm using rep- 
resentative examples from mesoscopic physics. The Schrodinger equation for the two- 
dimensional electron gas (2DEG) is usually transformed into a tight-binding problem by 
the method of finite differences 0, @, 0], where the continuous differential equation is 
replaced by a set of linear equations involving only the values of the wave function on 
discrete grid points. Commonly, these points are arranged in a regular, square grid. This 
grid, together with the shape of the particular structure under consideration then defines 
the structure of the Hamilton matrix and the corresponding graph. 

The representative examples considered here are shown in Fig. [8] The circle (Fig. [51(a)) 
and the asymmetric Sinai billiard (Fig.[5Jb)) that are examples of integrable and chaotic 
billiards in quantum chaos, the ring (Fig. EJc)) that may exhibit various interference 
physics, and the circular cavity with leads that are not parallel (Fig.[8jd)) as an example 
of a structure that does not have an intuitive, natural block-tridiagonal ordering. For 
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Fig. 8. Typical examples of structures considered in two-dimensional mesoscopic systems: (a) circle 
billiard, (b) asymmetric Sinai billiard, (c) ring, and (d) circular cavity with perpendicular leads. The 
tight-binding grid arises from the finite difference approximation to the Schrodinger equation. Note that 
the number of grid points used here was deliberately chosen very small for visualization purposes, fn a 
real calculation, the number of grid points would be at least 2 orders of magnitude larger, dcxtcnt denotes 
a length characterizing the extent for the different structures. 

all these structures, we introduce a length scale d cx tont, given by the outer radius of 
the circular structures and the side length of the square structure, characterizing the 
maximum extent. The fineness of the grid, and thus the size of the corresponding graph 
will be measured in number of grid points per length d ex tent- 

We now apply the block-tridiagonalization algorithm using the various optimization 
criteria discussed in the previous section, and compare the resulting orderings with the 
natural ordering and the ordering generated by the GPS algorithm. The weights w(H), 
Eq. (fTi|) , of the different orderings are given in Table [T] 

The initial distributions for the bisection algorithm are done in two different ways: The 
vertices are distributed both in an ordered way — by BFS — and randomly. The outcome 
after the optimization however is always similar for both types of initial distributions 
which indicates that the resulting weights are close to the global minimum and not stuck 
in a local minimum. Note that we use twice as many FM passes for a random initial 
distribution than for an initial distribution by BFS, as convergence is usually slower for 
a random initial distribution. 

In all examples, the min-net-cut criterion yields orderings with the best weights, as 
expected from the considerations of the previous section. Based on the weight, order- 
ings according to this criterion are expected to give the best performance in transport 
calculations such as the RGF algorithm. Note that the min-net-cut-min-cut ordering 
is on average closest to the best ordering. The min-net-cut ordering sometimes suffers 
from slow convergence, when the algorithm must traverse a wide local minimum. The 
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Table 1 

Weights w(H), Eq. 114> . for the block-tridiagonal ordering constructed by different algorithms for the 
examples of Fig. [5J. Optimization was done by 10 passes of the FM algorithm, when the initial bisection 
was constructed by BFS (algorithm [2] step C.(a)), and 20 passes, when the initial bisection was con- 
structed by a random distribution of vertices (algorithm [2] step C.(b)). The minimal weights for each 
system are printed bold. In all examples, there were 400 grid points per length e£ cx tcnt- 





Circular billiard 


Asymmetric 
Sinai billiard 


Ring 


Cavity with 
perp. leads 


natural block- 
tridiagonal ordering 


1.51 X 10 10 


1.58 x 10 10 


8.72 x 10 8 




Gibbs-Poolc- 
Stockmeyer 


1.15 X 10 12 


7.84 x 10 11 


2.14 x 10 8 


7.05 x 10 12 


distribution by BFS, 

11U yJ\-) LlIlllZjClLlUll 


1.51 x 10 10 


9.29 x 10 9 


2.1 x 10 8 


1.69 x 10 10 


distribution by BFS, 
min-cut 


1.51 x 10 10 


9.67 x 10 9 


2.1 x 10 8 


1.59 x 10 10 


random distribution, 
min-cut 


2.22 x 10 10 


9.95 x 10 9 


2.1 x 10 8 


5.13 x 10 10 


distribution by BFS, 
min-net-cut 


1.51 x 10 10 


9.46 x 10 9 


2.1 x 10 8 


1.18 x 10 10 


random distribution, 
min-net-cut 


1.46 x 10 10 


9.0 x 10 9 


2.09 x 10 8 


1.18 x 10 10 


distribution by BFS, 
min-net-cut-min-cut 


1.26 x 10 10 


9.28 x 10 9 


2.08 X 10 s 


1.24 x 10 10 


random distribution, 
min-net-cut-min-cut 


1.27 x 10 10 


9.16 x 10 9 


2.09 x 10 8 


2.02 x 10 10 



additional min-cut criterion helps to break ties and thus avoids these wide local minima. 

Except for the ring, where all algorithms perform well, the GPS algorithm yields 
weights that are even larger than the weight of the natural ordering. As discussed above, 
the GPS algorithms performs well, if both leads are furthest apart in terms of the graph. 
In the case of the ring, this is approximately fulfilled. In the general case, when the leads 
are at arbitrary positions, the GPS algorithm usually produces some very large levels. 
As the level size enters cubically in the w^H), this results in a prohibitively large weight. 
The GPS algorithm thus cannot be used as a generic reordering algorithm for quantum 
transport according to problem 12.21 

In summary, the block-tridiagonalization algorithm [3] in the combination of initial 
distribution by BFS and min-net-cut-min-cut optimization yields the best reorderings 
with respect to the weight w(H). Experience shows that usually 10 FM passes are enough 
for optimizing a bisection. As a consequence, we will use this combination exclusively in 
the rest of this work. 

The weight w(H) of a matrix is a global measure of the quality of a ordering. Additional 
insight can be gained from the distribution of the sizes Mj of the matrix blocks/levels. 
In Fig. [9] we show this distribution before and after reordering. For the natural ordering 
of the finite difference grids, the number of matrix blocks is determined by the number 
of lattice points along the x-coordinate direction (see Fig.QJb)). In contrast, the number 
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Fig. 9. Level (matrix block) size Af, as a function of the level (matrix block) index i for the natural level 
set (dashed line) and the min-net-cut-min-cut reordering (solid line), shown for (a) the circle billiard, 
(b) the asymmetric Sinai billiard, (c) the ring, and (d) the circular cavity with perpendicular leads. Note 
that for (d), there is no natural ordering. In all examples, there were 400 grid points per length d ex tent- 



of matrix blocks after reordering is given by the length of the shortest path between the 
two leads, in terms of the corresponding graph. 

In the case of the circle billiard, Fig. [U[a), the number of matrix blocks is the same for 
the natural ordering and the reordered matrix, as the shortest path between the leads is 
simply a straight line along the ^-coordinate direction. The improvements in the weight 
originate only from balancing the matrix block sizes: While the matrix block sizes vary 
for the natural ordering — the lateral size changes along the ^-direction — the reordered 
matrix has equally sized matrix blocks. For this particular example, the result of the 
block-tridiagonalization algorithm is optimal, as it yields the best solution with respect 
to the requirements set forth in problems 12.11 and 12.21 Note that in general it is not 
always possible to find a perfectly balanced partitioning, but the circle billiard is such 
an example. 

In contrast, in the case of the asymmetric Sinai billiard and the ring the number of 
matrix blocks generated by the block-tridiagonalization algorithm is larger than in the 
natural ordering (see Figs.[^b) and (c), respectively). In both cases, the obstacle within 
the scattering region increases the length of the shortest path connecting the two leads. 
In both examples, this increase in the number of matrix blocks leads to a significantly 
decreased weight w(H) with respect to the natural ordering, although the partitioning is 
only approximately balanced. For instance, in the particular case of the ring, the number 
of matrix blocks after reordering is approximately given by the number of lattice points 
around half of the circumference. The reordered ring thus has a weight very similar to 
a straight wire with a width twice as large as the width of one arm of the ring, and a 
length given by half of the ring circumference. 
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Fig. 10. (a)— (c): relative gain in computational time r cpu _ti m c, Eq- 118> . through the reordering as a 
function of the grid size for the circular billiard, the asymmetric Sinai billiard, and the ring, respectively, 
^cpu-time is shown excluding (□) and including (Q) the overhead of matrix reordering. The estimate 
for '"cpu-timc from the weights w(H) of the different orderings is shown as a dashed line, (d): fraction 
of time r matr ; x reordering i Eq. (119b . used for reordering the matrix as a function of the grid size. Data 
is shown for the circular billiard (□), the asymmetric Sinai billiard (O)i the ring (A), and the circular 
cavity with perpendicular leads (+). The benchmarks were run on Pentium 4 processor with 2.8 GHz 
and 2 GBs of memory. 

For the cavity with perpendicular leads, there is no natural ordering, and a specialized 
transport algorithm would be required. The reordering creates a matrix with approxi- 
mately balanced block sizes, and allows the direct application of conventional algorithms. 

The weight w(H) was introduced as a theoretical concept in order to simulate the 
computational complexity of a transport calculation. After discussing the influence of the 
reordering on this theoretical concept, we now demonstrate how the reordering increases 
the performance of an actual quantum transport calculation. 

To this end we use a straight-forward implementation of the well-established recursive 
Green's function algorithm for two terminals, as described in Ref. 29]. The necessary 
linear algebra operations are performed using the ATLAS implementation of LAPACK 



and BLAS [55l |56|. optimized for specific processors. It should be emphasized that the 
code that does the actual transport calculation — such as calculation of the Green's func- 
tion and evaluation of the Fisher-Lee relation — is the same for all examples considered 
here, including the non-trivial cavity with perpendicular leads. The abstraction necessary 
for the reordering, i.e. the graph structure and the corresponding level set, allows for a 
generic computational code applicable to any tight-binding model. 
We measure the performance gain through matrix reordering as 

computing time for natural ordering 

^cpu-timc — 7"; 7"- r i i ] • ■ 

computing time tor reordered matrix 



(18) 
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Note that during a real calculation, the conductance is usually not only calculated once, 
but repeatedly as a function of some parameters, such as Fermi energy or magnetic 
field. Thus, the respective quantum transport algorithm is executed repeatedly, too. In 
contrast, the block-tridiagonalization has to be carried out again only when the structure 
of the matrix and thus the corresponding graph changes. For the examples considered 
here this would correspond to changing the grid spacing or the shape of the structure. In 
such a case, the overhead of matrix reordering must be taken into account for r cpu _ t i mc . 
This overhead can be quantified as 

overhead of matrix reordering 

^matrix reordering — ~. 7": ■ i 7- i ■ ' V ^/ 

computing time including reordering 

In a typical calculation however, the matrix structure given by the underlying tight- 
binding grid does not change, and the matrix reordering must be carried out only once. In 
this common situation, the overhead of matrix reordering is negligible. For example, any 
change of physical parameters such as Fermi energy, magnetic field or disorder averages 
does not change the matrix structure. 

In Fig. [TU] we show the performance gain through matrix reordering, rcpu.timc, as a 
function of grid size for the circle billiard, the asymmetric Sinai billiard, and the ring 
(Figs. [TOT aWc). respectively). We include both measurements excluding and including 
the overhead of matrix reordering, as discussed above. Remember that in the case of 
the cavity with perpendicular leads, Fig. [8jd), there is no natural ordering and thus a 
performance comparison is not possible. In fact for this system, only matrix reordering 
makes a transport calculation possible in the first place. 

We find that block-tridiagonalization always increases the algorithmic performance in 
the typical situation, when the overhead of matrix reordering can be neglected. However, 
even if the reordering overhead is taken into account, we see a significant performance 
gain except for small systems — but there the total computing time is very short anyway. 
In fact, as the system sizes increases, the overhead of reordering becomes negligible, 
as predicted from the analysis of the computational complexity, and the performance 
gains including and excluding the reordering overhead converge. This can also be seen 
in Fig. HOf d). where we show the reordering overhead r ma t r i x reordering as a function of 
system size. 

Especially for large systems, the total computing time can become very long, and any 
performance gain is beneficial. Reordering leads to significant performance gains up to a 
factor of 3 in the case of the ring. The performance gain r cpu _ti m e can also be estimated 
from the weights w(H) of the original matrix (the natural ordering) and the reordered 
matrix, shown as the dashed line in Figs. fTOT aWc). The actual, measured performance 
gain approaches this theoretical value, as the system size increases. Note that we do 
not fully reach the theoretically predicted performance gain in the case of the ring. On 
modern computer architectures, computing time does not only depend on the number 
of arithmetic operations 56], and thus the weight w(H) overestimates the performance 
gain, though the performance still improves significantly. 

Finally, we demonstrate the 0(-/V gr id log A^id) scaling of the reordering algorithm. 
Fig. [11] shows the computing times of the block-tridiagonalization algorithm as a func- 
tion of matrix/graph size N for the geometries considered in this section. For all sys- 
tems, the computing times scale according to the prediction from the complexity anal- 
ysis in Sec. 12.2.31 as apparent from the fit oc 7V gr id log -/V gr id . Note that for large iV gr id, 
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Fig. 11. Time spent for matrix reordering as a function of the total grid (matrix) size N, for the circular 
billiard (□), the asymmetric Sinai billiard (O)i the ring (A), and the circular cavity with perpendicular 
leads (+). The solid line is a fit to the predicted scaling of the computational complexity, TV log TV. 



C(Agrid log-/Vg r id) scaling is practically indistinguishable from 0(iVg r id), as can also be 
seen in Fig. [TTJ 

In the examples of this section, we considered the pedagogic case of charge transport 
on a square, finite difference grid. The approach presented here can however immedi ately 



applied to more complex situations, such as spin transport, as reviewed in Ref. 57 1. 
In addition, extending the transport calculation to a different grid is straightforward, 
as any tight-binding grid can be encoded into a graph, and the block-tridiagonalization 
algorithm has already been applied to the case of the hexagonal grid of graphene [58| (for 
a review on graphene see (59l|). A further example of this versatility is shown in the next 
section, where we apply the block-tridiagonalization algorithm to solve multi-terminal 
structures involving different tight-binding models. 



3.2. Multi-terminal structures 



In the previous section, we demonstrated that matrix reordering increases the per- 
formance of quantum transport algorithms for two-terminal structures and addition- 
ally makes it possible to apply these conventional algorithms to non-trivial structures. 
Whereas there is a great variety of quantum transport algorithms for systems with two 
leads, there are only few algorithms that are suitable for multi-terminal structures, and 
most of these are restricted to rather specific geometries (e.g. Ref. 30}). Only recently 
algorithms have been developed that claim to be applicable to any multi-terminal struc- 
ture. The knitting algorithm of Ref. [6c| is a variant of the RGF algorithm where the 
system is built up adding every lattice point individually, instead of adding whole blocks 
of lattice points at a time. Therefore, instead of a matrix multiplication, the central 
computational step is an exterior product of vectors. Unfortunately, this implies that the 
knitting algorithm cannot use highly optimized matrix multiplication routines (Level 3 
BLAS operations), that are usually much more efficient than their vector counterparts 
(Level 2 BLAS operations), as discussed in Ref. (56J. Another multi-terminal transport 
algorithm presented recently 61], is based on the transfer matrix approach. However, it 
requires the Hamiltonian to be in a specific block-tridiagonal form, and the corresponding 
level set is set up manually. 
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Fig. 12. A multi-terminal structure can be reduced to an equivalent two-terminal structure by collecting 
all leads in two virtual leads, (a) The leads are redistributed into two virtual leads, (b) All leads are 
combined in a single virtual lead, the second virtual lead is formed by a vertex furthest away. 
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Fig. 13. Example of a four-terminal calculation: Quantum hall effect (a) in a two-dimensional electron 
gas and (b) in graphene. The Hall resistance Rn is shown as a function of W/l cyc i, where W is the width 
of the Hall bar and Z cyc i the cyclotron radius in a magnetic field B. Note that W/l cyc \ oc B. The dotted 



lines indicate the quantized values of the Hall resistance, h/2e 



where n is a positive integer. 



Here we show how to employ the block-tridiagonalization algorithm in order to apply 
the well-established two-terminal quantum transport algorithms to an arbitrary multi- 
terminal system. The basic idea is sketched in Fig.fT^a): Combining several (real) leads 
into only two virtual leads the multi-terminal problem is reduced to an equivalent two- 
terminal problem. After reordering, the resulting problem can then be solved by conven- 
tional two-terminal algorithms. Note that in this approach the number of matrix blocks 
is given by the shortest path between leads in two different virtual leads. If all leads are 
very close together, this may lead to only few, large blocks in the reordered matrix and 
respectively levels in the graph partitioning, leading to a very large weight w(H). In such 
a case it is advisable to collect all leads into a single virtual lead. The second virtual lead 
is then formed by a vertex in the graph, that is furthest away from all leads as depicted 
in Fig. rT27b). Such a vertex can be found by a BFS search originating from all leads. 
Thereby the number of matrix blocks/levels is maximized. In fact, this a ppr oach yields 
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a block-tridiagonal matrix structure as required by the algorithm of Ref. 

We now demonstrate these strategies on the example of the quantum Hall effect (QHE) 
in a 2DEG formed in a semiconductor heterostructure [Hit ] and in graphene (r33l | . For this 
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we use a four-terminal Hall bar geometry as sketched in Fig. I12f a). on top of a square 
lattice (finite difference approximation to 2DEG) and a hexagonal lattice . Again, it 
should be emphasized that the code of the actual transport calculation is the same 
as employed in the two-terminal examples of the previous section. The results of the 
calculation are shown in Fig. Q2J where the integer QHE of the 2DEG and the odd- 
integer QHE of graphene are clearly visible. 

The methods outlined above make it possible to calculate quantum transport in any 
system described by a tight-binding Hamiltonian. This generality is one of the main 
advantages gained by using the matrix reordering. However, generality also implies that 
it is difficult to make use of properties of specific systems, such as symmetries, in order 
to speed up calculations. Special algorithms developed specifically for a certain system 
however can, and will usually be faster than a generic approach — at the cost of additional 
development time. 

In the case of the Hall geometry in a 2DEG, such a special algorithm was presented by 
Baranger et al. [30j | . and we have implemented a variant of it. Comparing the computing 
times for the Hall bar geometry in a 2DEG, we find that the special algorithm is only a 
factor of 1.6—1.7 faster than our generic approach. Although such a performance compar- 
ison may depend crucially on the details of the system under consideration, experience 
shows that the use of the generic approach often does not come with a big performance 
penalty. 



4. Conclusions 



We have developed a block-tridiagonalization algorithm based on graph partitioning 
techniques that can serve as a preconditioning step for a wide class of quantum transport 
algorithms. The algorithm can be applied to any Hamilton matrix originating from an 
arbitrary tight-binding formulation and brings this matrix into a form that is more 
suitable for many two-terminal quantum transport algorithms, such as the widely used 
recursive Green's function algorithm. The advantages of this reordering are twofold: First, 
the reordering can speed up the transport calculation significantly. Second, it allows for 
applying conventional two-terminal algorithms to non-trivial geometries including non- 
collinear leads and multi-terminal systems. The block-tridiagonalization algorithm scales 
as C(-/Vg r id log Agrid), where -/V gr id is the size of the Hamilton matrix, and thus induces 
only little additional overhead. We have demonstrated the performance of the matrix 
reordering on representative examples, including transport in 2DEGs and graphene. 

The block-tridiagonalization algorithm can operate as a black box and serve as the 
foundation of a generic transport code that can be applied to arbitrary tight-binding 
systems. Such a generic transport code is desirable, as it minimizes development time 
and increases code quality, as only few basic transport routines are necessary, that can 
be tested thoroughly. 

We acknowledge financial support from DFG within GRK638 and SFB689. 
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a) b) 




Fig. A.l. Schematic representation of (a) a simple graph and (b) the corresponding hypergraph structure 
imposed through all nets, M = {net(n) | v £ V}. 

Appendix A. The Fiduccia-Mattheyses algorithm 

A.l. Graphs and hypergraphs 

The Fiduccia-Mattheyses algorithm was originally developed for hypergraph partition- 
ing (42J. A hypergraph Ti. is an ordered pair H = (V,7V), where V is a set of vertices, and 
J\f a set of nets n.; (also called hyperedges) between them. A net is a set of vertices, 
i.e. ni C V. An undirected graph is a special realization of a hypergraph, where every 
net contains exactly two vertices. Thus, any algorithm for hypergraph partitioning can 
also be applied to an undirected graph. 

During the FM bisection, we have to consider the graph structure arising from the 
Hamiltonian matrix in order to minimize the number of cut edges (min-cut), whereas 
for minimizing the number of surface vertices, i.e. the number of cut nets (min-net- 
cut), the hypergraph structure arising from all nets net(u) as defined in Eq. (TTT|) . N = 
{net(w) I v € V}, is essential. For min-net-cut- min-cut optimization, we have to consider 
both structures simultaneously A schematic representation of a graph and the corre- 
sponding hypergraph structure is shown in Fig. IA.1I 

A. 2. Fiduccia-Mattheyses bisection 

The FM algorithm is based on the concept of gain. The gain of a vertex in an existing 
bisection is defined as the change in weight, i.e. the number of cut edges or nets, that 
occurs when this vertex is move to the other part. This gain can also be negative, if such a 
move increases the number of cut edges or nets. The basic idea of the FM algorithm is to 
swap vertices with the highest gain between parts, while obeying some balance criterion. 
The fact that the highest gain can be negative, helps the FM algorithm to escape local 
minima. After moving, the respective vertex is locked in order to avoid an infinite loop, 
where a single vertex might be swapped back and forth repeatedly. The FM pass ends, 
when all (free) vertices have been moved, and the best bisection encountered during the 
pass is returned as result. Further passes can then successively improve on this bisection. 
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